Dataset CSV file: diabetes_012_health_indicators_BRFSS2015.csv
Group No.: 18
Group Members:
VARINDER SINGH - 2021fc04070@wilp.bits-pilani.ac.in
BANDARU RAJA SEKHAR - 2021fc04074@wilp.bits-pilani.ac.in
MIKHIL. P.A. - 2021fc04326@wilp.bits-pilani.ac.in
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings
raw_df = pd.read_csv("diabetes_012_health_indicators_BRFSS2015.csv")
raw_df
raw_df.info()
raw_df.describe()
raw_df.head(2)
raw_df.shape
print("Number of Duplicates before processing the dataset: ", raw_df.duplicated().sum())
raw_df = raw_df.drop_duplicates(keep='last')
raw_df.reset_index(inplace = True, drop = True)
print("Number of Duplicates after processing the dataset: ", raw_df.duplicated().sum())
raw_df.shape
Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations, i.e one class label has a very high number of observations and the other has a very low number of observations.
There is an extreme class imbalance in the given diabetes dataset.
plt.figure(figsize=(8, 4))
ax = sns.countplot( x="Diabetes_012", data=raw_df )
for p in ax.patches:
ax.annotate('{:}'.format(p.get_height()), (p.get_x()+0.45, p.get_height()+0.55))
Random oversampling involves randomly duplicating examples from the minority class and adding them to the training dataset. This technique can be effective for those machine learning algorithms that are affected by a skewed distribution and where multiple duplicate examples for a given class can influence the fit of the model. It might be useful to tune the target class distribution.
shuffled_df = raw_df.sample( frac=1 ,random_state=42)
zero_df = shuffled_df.loc[shuffled_df['Diabetes_012'] == 0.0].sample(n=19000, replace=True)
one_df = shuffled_df.loc[shuffled_df['Diabetes_012'] == 1.0].sample(n=19000, replace=True)
two_df = shuffled_df.loc[shuffled_df['Diabetes_012'] == 2.0].sample(n=19000, replace=True)
balanced_df = pd.concat([zero_df, one_df, two_df])
balanced_df.reset_index(inplace = True, drop = True)
plt.figure(figsize=(8, 4))
ax = sns.countplot( x="Diabetes_012", data=balanced_df )
for p in ax.patches:
ax.annotate('{:}'.format(p.get_height()), (p.get_x()+0.45, p.get_height()+0.55))
Here, we are sampling for class output of 0.0, 1.0 and 2.0. to 19000 individually class.
from pandas.plotting import scatter_matrix
scatter_matrix(balanced_df, figsize=(50, 50))
plt.show()
corr = balanced_df.corr()
plt.figure(figsize=(20, 20))
mask = np.triu( np.ones_like(corr) )
hm = sns.heatmap( corr, mask=mask, vmin=-1, vmax=1, annot=True, cmap='Spectral' )
hm.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=12)
There are no attributes which are highly correlated with each other; which can be considered to make changes to any feature selection. Therefore, we are not making any changes to feature selection. We are using all the column as provided.
balanced_df.info()
col = list(balanced_df.columns)
warnings.filterwarnings('ignore')
plt.figure(figsize=(30,50))
num = 1
for i in col:
plt.subplot(11,4,num)
sns.distplot(balanced_df[str(i)])
num = num + 1
plt.subplot(11,4,num)
sns.boxplot(balanced_df[str(i)])
num = num + 1
balanced_df.boxplot(figsize = (30,10), column = col)
Winsorization of Outliers is the process of replacing the extreme values of statistical data in order to limit the effect of the outliers on the calculations or the results obtained by using that data.
for i in col:
percentile25 = balanced_df[str(i)].quantile(0.25)
percentile75 = balanced_df[str(i)].quantile(0.75)
iqr = percentile75 - percentile25
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
balanced_df[str(i)] = np.where(balanced_df[str(i)] >= upper_limit,
upper_limit,
np.where(balanced_df[str(i)] <= lower_limit,
lower_limit,
balanced_df[str(i)]
)
)
balanced_df.boxplot(figsize = (30,10), column = col)
Standard Scaler helps to get standardized distribution, with a zero mean and standard deviation of one (unit variance). It standardizes features by subtracting the mean value from the feature and then dividing the result by feature standard deviation.
balanced_df
balanced_df.iloc[: , :]
balanced_df.iloc[: , 1:]
#Standardising the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
df_standardised = sc.fit_transform( balanced_df.iloc[: , 1:] )
df_standardised
#Adding the column name to standardised dataset
df_standardised = pd.DataFrame(df_standardised,
columns = col[1:])
df_standardised
#Resetting the index
df_standardised = df_standardised.reset_index(drop = True)
raw_df['Diabetes_012'] = raw_df['Diabetes_012'].reset_index(drop = True)
#Adding target column to standardised dataset
df_standardised['Diabetes_012'] = raw_df['Diabetes_012']
df_standardised
df_standardised
Case 1 : Train = 90 % Test = 10%
Case 2 : Train = 50 % Test = 50%
X = df_standardised.iloc[:, :-1].values
Y = df_standardised.iloc[:, -1].values
from sklearn.model_selection import train_test_split
#Case 1 with test size 10%
X_train_case1, X_test_case1, Y_train_case1, Y_test_case1 = train_test_split(X, Y, test_size = 0.1, random_state = 0)
#Case 2 with test size 50%
X_train_case2, X_test_case2, Y_train_case2, Y_test_case2 = train_test_split(X, Y, test_size = 0.5, random_state = 0)
from sklearn.ensemble import RandomForestClassifier
reg_case1 = RandomForestClassifier(n_estimators=10, random_state=0)
# Case- 1
reg_case1.fit(X_train_case1,Y_train_case1)
# Predicting
Y_pred_RandomForest_train_case1 = reg_case1.predict(X_train_case1)
Y_pred_RandomForest_test_case1 = reg_case1.predict(X_test_case1)
reg_case2 = RandomForestClassifier(n_estimators=10, random_state=0)
# Case- 2
reg_case2.fit(X_train_case2,Y_train_case2)
#Predicting
Y_pred_RandomForest_train_case2 = reg_case2.predict(X_train_case2)
Y_pred_RandomForest_test_case2 = reg_case2.predict(X_test_case2)
from sklearn.neighbors import KNeighborsClassifier
# Case-1
classifier_case1 = KNeighborsClassifier(n_neighbors=5, metric='euclidean', p=2)
classifier_case1.fit(X_train_case1, Y_train_case1)
Y_pred_KNN_train_case1 = classifier_case1.predict(X_train_case1)
Y_pred_KNN_test_case1 = classifier_case1.predict(X_test_case1)
# Case-2
classifier_case2 = KNeighborsClassifier(n_neighbors=5, metric='euclidean', p=2)
classifier_case2.fit(X_train_case2, Y_train_case2)
Y_pred_KNN_train_case2 = classifier_case2.predict(X_train_case2)
Y_pred_KNN_test_case2 = classifier_case2.predict(X_test_case2)
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
def accuracy(y_pred, y_test):
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix : \n", cm, '\n')
acc = accuracy_score(y_test, y_pred)*100
print("Accuracy Score : {0:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("Precision : {0:.4f}".format(precision_score(y_test, y_pred, average="weighted")))
print("Recall : {0:.4f}".format(recall_score(y_test, y_pred, average="weighted")))
print("F1 Score : {0:.4f}".format(f1_score(y_test, y_pred, average="weighted")))
return acc
print("=============================================================\n \
Random Forest | CASE - 1 | Prediction Evaluation metrics\n\
=============================================================\n")
print("++++++++++++++++++\nTest Set Accuracy: \n++++++++++++++++++")
acc_RandomForest_case1 = accuracy(Y_pred_RandomForest_test_case1, Y_test_case1)
print("=============================================================\n \
Random Forest | CASE - 1 | Prediction Evaluation metrics\n\
=============================================================\n")
print("++++++++++++++++++\nTest Set Accuracy: \n++++++++++++++++++")
acc_RandomForest_case2 = accuracy(Y_pred_RandomForest_test_case2, Y_test_case2)
print("=============================================================\n \
KNN | CASE - 1 | Prediction Evaluation metrics\n\
=============================================================\n")
print("++++++++++++++++++\nTest Set Accuracy: \n++++++++++++++++++")
acc_KNN_case1 = accuracy(Y_pred_KNN_test_case1, Y_test_case1)
print("=============================================================\n \
KNN | CASE - 2 | Prediction Evaluation metrics\n\
=============================================================\n")
print("++++++++++++++++++\nTest Set Accuracy: \n++++++++++++++++++")
acc_KNN_case2 = accuracy(Y_pred_KNN_test_case2, Y_test_case2)
Model
KNN Model is slightly more accurate than Random Forest Model in diabetes classification.
Test Case
Case-1 dataset is slightly more accurate than Case-2 dataset prediction.
warnings.filterwarnings('ignore')
!pip install tabulate
from tabulate import tabulate
data = [
["Random Forest Model", \
"{0:.2f}%".format(accuracy_score(Y_train_case1, Y_pred_RandomForest_train_case1)*100), \
"{0:.2f}%".format(accuracy_score(Y_test_case1, Y_pred_RandomForest_test_case1)*100), \
"{0:.2f}%".format(accuracy_score(Y_train_case2, Y_pred_RandomForest_train_case2)*100), \
"{0:.2f}%".format(accuracy_score(Y_test_case2, Y_pred_RandomForest_test_case2)*100) \
],
["KNN Model", \
"{0:.2f}%".format(accuracy_score(Y_train_case1, Y_pred_KNN_train_case1)*100), \
"{0:.2f}%".format(accuracy_score(Y_test_case1, Y_pred_KNN_test_case1)*100), \
"{0:.2f}%".format(accuracy_score(Y_train_case2, Y_pred_KNN_train_case2)*100), \
"{0:.2f}%".format(accuracy_score(Y_test_case2, Y_pred_KNN_test_case2)*100)
]
]
header = ["Name of the Model", \
"Accuracy for Train Case-1", \
"Accuracy for Test Case-1", \
"Accuracy for Train Case-2", \
"Accuracy for Test Case-2"]
print(tabulate(data, headers = header, tablefmt="rst"))
Case 1 (Train = 90 % Test = 10%)
Accuracy of Training set is 90.91% and Accuracy for Test set is 78.68%. We could say that the case-1 model is overfit.
Case 2 (Train = 50 % Test = 50%)
Accuracy of Training set is 92.42% and Accuracy for Test set is 77.96%. We could say that the case-2 model is overfit.
Case 1 (Train = 90 % Test = 10%)
Accuracy of Training set is 82.72% and Accuracy for Test set is 80.32%. We could say that the case-1 model is overfit.
Case 2 (Train = 50 % Test = 50%)
Accuracy of Training set is 82.88% and Accuracy for Test set is 80.25%. We could say that the case-2 model is overfit.